Multiagent off-screen behavior prediction in football

In multiagent worlds, several decision-making individuals interact while adhering to the dynamics constraints imposed by the environment. These interactions, combined with the potential stochasticity of the agents’ dynamic behaviors, make such systems complex and interesting to study from a decision-making perspective. Significant research has been conducted on learning models for forward-direction estimation of agent behaviors, for example, pedestrian predictions used for collision-avoidance in self-driving cars. In many settings, only sporadic observations of agents may be available in a given trajectory sequence. In football, subsets of players may come in and out of view of broadcast video footage, while unobserved players continue to interact off-screen. In this paper, we study the problem of multiagent time-series imputation in the context of human football play, where available past and future observations of subsets of agents are used to estimate missing observations for other agents. Our approach, called the Graph Imputer, uses past and future information in combination with graph networks and variational autoencoders to enable learning of a distribution of imputed trajectories. We demonstrate our approach on multiagent settings involving players that are partially-observable, using the Graph Imputer to predict the behaviors of off-screen players. To quantitatively evaluate the approach, we conduct experiments on football matches with ground truth trajectory data, using a camera module to simulate the off-screen player state estimation setting. We subsequently use our approach for downstream football analytics under partial observability using the well-established framework of pitch control, which traditionally relies on fully observed data. We illustrate that our method outperforms several state-of-the-art approaches, including those hand-crafted for football, across all considered metrics.

: Football-specific role-invariant VRNN baseline architecture. The input manipulations conducted in this model, which are detailed in the Baseline Model Details section of the Supplementary Information, ensure invariance of outputs to permutations of player orders within each time. However, this also implies the strongest assumption on the domain at hand, which is interaction of two teams of players, along with a singleton entity (the ball), in a shared environment.
We provide an overview of this model in Fig. S1. At each timestep, this model works by passing the player and ball observations through LSTMs with shared parameters. The hidden states from the LSTMs are subsequently summed up within each team, thus producing team-level contextual vectors; note that this operation ensures permutation-invariance within each team. Subsequently, the ball and each team's hidden state are passed through an MLP, thus producing a vector representing the overall game-level context. Finally, to produce predictions for individual players, this game-level context is concatenated to their individual LSTM states to produce a player-specific context, which is then passed through an MLP-based VAE, enabling sampling of players' next-states.
The process iterates autoregressively as in the Graph Imputer, and likewise can be repeated in reverse temporal fashion to produce and fuse bidirectional player state estimates. Training of this model is conducted in the same manner as the Graph Imputer, using the ELBO (15).
In addition to this model, we also include in Table S2 a variant called 'Role-invariant RNN', which simply replaces the VAE head in Fig. S1 with an MLP.

Additional Hyperparameter Details & Computational Resources
In addition to the key hyperparameters detailed in the main paper, we also ran sweeps for VAE-based models wherein a standard normal prior distribution was used (in lieu of a learned prior), as typically also considered in VAE approaches. For the Social LSTM model, we also ran sweeps over grid widths 8, 24, and 64 capturing the size of neighbor grids for each player (in meters); larger grid sizes correspond to increasing amounts of neighbor context on the football pitch.
For training, we use a cluster of Tesla V100 and P100 GPUs for training and evaluation, respectively. Overall, our sweeps were conducted over a set of 435 independent training runs (i.e., each with a unique hyperparameter set and random seed). Depending on the simplicity of the underlying model (simplest being the autoregressive LSTM, and most complex being the Graph Imputer), each training run took approximately 3 to 15 hours of wallclock time to train. Table S2 presents additional comparative sweeps for the football off-screen player state estimation scenario. In addition to the results in the main paper, this table includes the Role-invariant RNN baseline detailed in the Baseline Model Details section. The Role-invariant RNN model achieves quite similar performance as the Role-invariant VRNN counterpart, with the main distinction being that the former model is deterministic, in contrast to the latter; in certain applications, the ability to resample the model (or, e.g., fine-tune the KL-regularization β in (15) to increase or decrease the level of stochasticity in the model) can be quite useful from a practical perspective.

Additional Sweeps and Baselines
Additionally, Table S2 includes sweeps over the bidirectional fusion modes ( (13) and (14) in the Methods section of the main text). For all bidirectional models, we observe that the nearest-observation weighted fusion mode (14) yields the lowest evaluation loss, primarily as it modulates the weighting of the directional updates (which deviate from the ground truth the longer they have not made an observation). 1.487 ± 0.023 Graph Imputer (Ours) 0.302 ± 0.005 Table S2: Football off-screen player state estimation results. We separate models into two categories: restricted models (those that apply only to the football setting, as they process data in a manner explicitly assuming two teams of players, along with a ball), and general models (models that apply to general multiagent prediction settings). The columns refer to the following: Skip connection: whether a skip-connection from the input to the decoder is enabled for autoencoder based models. Next-step conditional decoder: whether decoders in graph network-based models condition on available next-timestep observations, as additional context. Bidir. fusion mode: the fusion mode used for bidirectional models, where 'mean' corresponds to (13) in the main text, and 'nearest' to (14). For each baseline model, we compute the mean evaluation loss, L 2 (Mean), compared to the ground truth trajectories (over all seeds). For stochastic models, for each evaluation sequence we also take 6 samples of imputed trajectories, and also report the minimum evaluation loss, L 2 (Min.), over all samples, averaged over all seeds.

Model Skip connection
Next-step conditional decoder Bidirectional fusion mode  As mentioned earlier, we anticipate that situations with increased partial observability will further compound errors associated with with standard interpolation techniques. To further investigate this, we generated a new dataset, reducing the camera's (horizontal, vertical) field-of-view from (45 • , 30 • ) to (30 • , 20 • ). Under this new camera model, on average, 8.51 ± 3.36 players (out of 22) are in-frame in each sequence, with a consecutive in-frame duration of 3.76s ± 3.09s; for comparison, these quantities were respectively 12.76 ± 3.70 players in-frame and 4.94s ± 3.49s in the original dataset reported in the main paper, thus illustrating a notable decrease in observability. Retraining the models with this new dataset results in the performance metrics reported in Table S1 (which are shown for the best hyperparameters for each model). In this new setting involving players that are out-of-view for longer periods of time, we see that the bidirectional social LSTM now outperforms the spline-based baselines. However, our Graph Imputer model continues to also substantially outperform all models, which provides further evidence of the robustness of our approach.

Additional Trajectory Visualizations
We provide a number of additional visualizations of trajectory predictions for the Graph Imputer and additional baselines in Figs. S2 to S5. Here we also include a variant of the Graph Imputer which attains high trajectory sequence variance, which can be useful from a downstream analytical perspective when higher sample stochasticity is desired.

Additional Details on Related Works
Table S3 provides an additional cross-section overview of the most closely related works to ours. In this table, we summarize models that consider prediction of trajectories, detailing whether or not they are stochastic, consider the interactions of multiple agents in the system, target the imputation problem (as opposed to the typical forward-prediction setting), and use both forward-and backward-information. Some of the models in this table are related to ours, although target slightly different problem regimes. For example, in Naomi 39 , the considered dataset regime is distinct from ours in that they consider scenarios wherein at each timestep either all players are simultaneously observed, or all are unobserved. By contrast, we consider situations where a subset of players is observed (while others are unobserved) at any given timestep; this is the scenario encountered in the off-screen player tracking problem targeted herein, where some players may be visible on-screen, whereas others may be off-screen. This distinction enables the approach of Naomi to essentially treat the multiagent observation x t at each time t as a single, high-dimensional input. Indeed, that is the primary distinction from our graph network-based approach, where the decompositionality afforded by the graph structure enables our model to treat mixed-observability settings. Similarly, in Baller2vec++ 30 , the introduced model uses multiagent information, though does not target the imputation setting considered here (where we consider distant future observations that are available for a subset of agents, with heterogeneous temporal gaps in observed data). The key contribution of their work, rather, is to use the probability chain rule to condition the generated trajectories of one agent on the generated trajectories of other agents, to induce better-correlated predictions, which is indeed an important feature to capture in sports-based models.